2. Definitions

  • intrinsic vs. post-hoc

  • global vs. local explanation

  • workflow

Learning outcomes

  1. Compare the competing definitions of interpretable machine learning, the motivations behind them, and metrics that can be used to quantify whether they have been met.

Reading

  • Lipton, Z. C. (2018). The Mythos of Model Interpretability. ACM Queue: Tomorrow’s Computing Today, 16(3), 31–57. https://doi.org/10.1145/3236386.3241340

  • Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44), 22071–22080. https://doi.org/10.1073/pnas.1900654116

Intrinsic Interpretability vs. Post-hoc Explanations

  • Intrinsically interpretable: Build a “glass box” from the start. The model is interpretable by design — its structure allows us to understand how it works.

  • Post Hoc: Inspect an already trained “black box” model, which can be chosen simply to maximize accuracy without regard to interpretability. Post-hoc methods extract explanations from models that weren’t designed to be understood.

Intrinsic interpretability

Some properties that make a model intrinsically interpretable are:

  • Sparsity
  • Simulatability
  • Modularity

We define these on the next few slides.

Sparsity

  • A model is sparse if the number of non-zero parameters is small relative to the total number of available parameters.

  • This is interpretability property is motivated by Occam’s razor: The simplest explanation is likely closest to the truth.

  • Sparsity enhances interpretability only when it correctly captures the structure of the true data-generating process. If the true relationship depends on many features, imposing sparsity introduces bias.

Simulatability

A model is simulatable if it’s possible to manually compute its output for any input within a reasonable time. Both the number of model parameters and the inference complexity factor in.

Modularity

  • A model is modular if its prediction function \(f(x)\) can be decomposed into interpretable components, each of which can be analyzed independently.

  • One example is an additive decomposition, where the function can be written as,

    \[\begin{align*} f\left(x\right) = b_{0} + \sum_{j = 1}^{J}f_{j}\left(x_{j}\right) \end{align*}\]

    and each \(f_{i}\) operates on only a single coordinate of the input \(x\).

  • More generally, a model is modular if subsets of its parameters or computations can be viewed as separate, interpretable units.

Post-hoc interpretability

Some common strategies for explaining black box models include,

  • Feature importances

  • Feature attributions

  • Model distillations

A few examples are given on the next few slides, but first we should distinguish between global and local explanations.

Global explanation

  • A global explanation characterizes how a model behaves across all possible inputs.

  • These explanations are valuable in scientific studies, where we usually look for universal rules relating sets of variables.

Example: Variable importance

According to this plot, the variables X4, X2, and X1 seem most important across three separate tree-based models (weeks 4 - 5).

Local explanation

  • A local explanation describes why a model made a specific prediction for a particular input. The relationships it finds may be unique to that example.

  • These are especially helpful in auditing high-stakes decisions made in specific cases, e.g. loan approvals, medical diagnoses, parole decisions.

Example: saliency map

In vision models, one use case of local explanations is to identify the parts of an image that are “most important” for particular predicted classes.

Interpretability-accuracy trade-offs

  • For many datasets, simple models (like decision trees) offer high descriptive accuracy but have lower predictive accuracy compared to more complex models (like random forests).

  • Similarly, deep neural networks might predict the future well but not be amenable to accurate descriptions in PDR sense.

Bias and Variance

One intuition for this tradeoff comes from the bias variance tradeoff. More complex models have lower bias, and if we have lots of data, variance is less of a concern.

Does it exist?

Unlike the bias-variance tradeoff, which has a precise mathematical foundation, the interpretability-accuracy tradeoff is more of a heuristic, and some have argued that it can be misleading (Rudin 2019).

If applied systematically, techniques from interpretability can be used to check assumptions, identify data and model quality issues, and uncover surprising relationships. The steps below are based off (Murdoch et al. 2019).

%%{init: {'theme':'forest', 'themeVariables': {'fontSize':'30px', 'fontFamily':'arial'}, 'flowchart': {'padding': 80}}}%%
graph LR
    A[Design] --> B[Predictive <br/> Accuracy]
    B --> C[Stability]
    C --> D[Explanation <br/> Comparisons]
    D --> E[External <br/> Checks]

Explanations for Communication

The talk (Kim 2022) describes how interpretability bridges human and machine “concepts.”

This bridge is especially important in collaborative work! Data science never exists in a vacuum.

Kim, Been. 2022. “ICLR 2022 Keynote: Been Kim.” https://youtu.be/Ub45cGEcTB0?si=DfwbpvDFJUWqeiIn.
Murdoch, W. James, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. 2019. “Definitions, Methods, and Applications in Interpretable Machine Learning.” Proceedings of the National Academy of Sciences 116 (44): 22071–80. https://doi.org/10.1073/pnas.1900654116.
Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1 (5): 206–15. https://doi.org/10.1038/s42256-019-0048-x.